In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()









    





<matplotlib.figure.Figure at 0x3edbc50>

Loading the training and test set

The first step is to load the training and set that was saved when running the Training and test feature set generation notebook.



In [2]:

    
cd ../../features/









    



/data/opencast/MRes/features



In [4]:

    
posdata = loadtxt("training.HIPPIE.positive.50.Entrez.vectors.txt", delimiter="\t", dtype="str")

Training with the full negative set is impossible on this computer, it is unable to load the file into RAM. To get around this, can use a negative training set only 10 times larger than the positive training set:



In [6]:

    
negdata = loadtxt("training.HIPPIE.negative.Entrez.vectors.txt", delimiter="\t", dtype="str")

There are some conflicting entries in this notebook detected in the training and test set generation. These can be removed using their indexes:



In [9]:

    
import pickle



In [10]:

    
f = open("training.HIPPIE.negative.conflicts.pickle")
conflicts = pickle.load(f)
f.close()



In [23]:

    
negdata = delete(negdata,array(conflicts.keys()),0)



In [24]:

    
X = np.concatenate((posdata,negdata))



In [25]:

    
plottablerows = X=="missing"
plottablerows = where(plottablerows.max(axis=1)==False)

Unfortunately, some "NA" strings have sneaked in somehow, so we will have to zero these. Appears to be due to a feature in Gene_Ontology:



In [26]:

    
NAinds = where(X=="NA")



In [27]:

    
X[NAinds] = 0.0



In [28]:

    
X[X=="missing"] = np.nan
X = X.astype(np.float)

Finally we can create the target vector y from what we know about the lengths of the positive and negative sets:



In [29]:

    
#create the target y vector
y = array([1]*len(posdata)+[0]*len(negdata))

Feature Selection

We can investigate which features are likely to be useful in a number of ways. This is covered in the Scikit-learn documentation.



In [74]:

    
import sklearn.metrics



In [25]:

    
maxes = amax(abs(X),axis=0) + 1e-14



In [33]:

    
mutualinfo = map(lambda x: sklearn.metrics.mutual_info_score(y[plottablerows],X[plottablerows][:,x]/maxes[x]),
                 range(222))



In [34]:

    
plot(mutualinfo)









    Out[34]:





[<matplotlib.lines.Line2D at 0x5166f3690>]

The above graph is very strange. At this point we are just going to have to continue and come back later to try and figure out what's going on with it. Must be something to do with the mutual_info_score method. It is probably not well suited to this problem.

Choosing a classifier

Each classifier will be tested in the following ways over :

Mean accuracy
Combined ROC curves
- AUC values
- Partial AUC values
Combined Precision vs. Recall curves
Informational loss function
Quadratic loss function
Kappa test?

The importance of different features will also be tested through elimination and repeating the tests above.

Finally, a sanity check looking at the model's prediction for some well-known protein interactions is the final test.

Logistic Regression

Logistic regression is a simple classifier which uses a linear transformation of the input variables through a sigmoid function.



In [31]:

    
cd tmp/









    



/data/opencast/MRes/features/tmp



In [79]:

    
import sklearn.linear_model



In [80]:

    
import sys, os
sys.path.append(os.path.abspath("../../opencast-bio/"))



In [81]:

    
import ocbio.model_selection, ocbio.mmap_utils
reload(ocbio.model_selection)
reload(ocbio.mmap_utils)









    Out[81]:





<module 'ocbio.mmap_utils' from '/data/opencast/MRes/opencast-bio/ocbio/mmap_utils.pyc'>



In [17]:

    
!ipcluster stop









    



2014-07-29 21:42:17.974 [IPClusterStop] Using existing profile dir: u'/home/fin/.ipython/profile_default'
2014-07-29 21:42:17.975 [IPClusterStop] CRITICAL | Could not read pid file, cluster is probably not running.



In [18]:

    
!ipcluster start -n=10 --daemon









    



2014-07-29 21:42:20.320 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_default'



In [82]:

    
from IPython.parallel import *



In [83]:

    
client = Client(profile='default')



In [84]:

    
#initialise load balanced distributed processing
lb_view = client.load_balanced_view()

Defining pipeline

We have to define a pipeline for the data to include two transformations for the data:

Imputing missing values
Standard scaling of the feature vectors

Both of these are built in operations of Scikit-learn. And they can be easily integrated to a pipeline in Scikit-learn.



In [85]:

    
import sklearn.pipeline, sklearn.preprocessing



In [86]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.linear_model.LogisticRegression()
model = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

We would like to plot a learning curve to get an idea of the required size of our training set to train a model on k-fold validation steps to get valid results. Until we decide on a final classifier it is preferable to be able to run fast iterations of different parameters for more complex parameters than to absolutely maximimise performance. A learning curve plots the performance of the algorithm on the training and test set for different training set sizes.

As with the imported code, much of this code is taken from the excellent parallel machine learning tutorial by Oliver Grisel.



In [87]:

    
trainsizes=logspace(3,5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(lb_view)
lcsearch.launch_for_arrays(model,X,y,trainsizes,n_cv_iter=5,params={'cls__C':0.316}, name="lrlc")









    Out[87]:





Progress: 00% (000/125)



In [99]:

    
print lcsearch









    



Progress: 100% (125/125)



In [100]:

    
lrlc = lcsearch.plot_curve()



In [101]:

    
savez("../../plots/hippie/logistic.regression.learning.curve.npz",lrlc)

This shows that we can test the model on a grid search with only 40,000 to 60,000 samples and still get meaningful results. When the grid search returns a best possible combination of parameters we will have to plot this learning curve again to make sure that the results are still valid. In the mean time, this will speed up our grid search considerably.

Running parallel grid search



In [94]:

    
altclient = Client(profile='altcluster')



In [95]:

    
alb_view = altclient.load_balanced_view()



In [98]:

    
#make sure the processors aren't busy doing anything else:
alb_view.abort()
#initialise parameters to test:
lr_params = {
'cls__C':logspace(-3,1,7)
}



In [97]:

    
#initialise mmaped data -  had to increase CV iterations to reduce output variance
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=10000, train_size=40000,
                                            n_cv_iter=20, name='testfeatures',random_state=42)



In [102]:

    
#initialise and start run
lrsearch = ocbio.model_selection.RandomizedGridSearch(alb_view)
lrsearch.launch_for_splits(model,lr_params,split_filenames)









    Out[102]:





Progress: 00% (000/140)



In [114]:

    
print lrsearch.report()









    



Progress: 100% (140/140)

Rank 1: validation: 0.99254 (+/-0.00021) train: 0.99268 (+/-0.00009):
 {'cls__C': 0.0046415888336127772}
Rank 2: validation: 0.99247 (+/-0.00021) train: 0.99281 (+/-0.00008):
 {'cls__C': 0.021544346900318832}
Rank 3: validation: 0.99245 (+/-0.00021) train: 0.99256 (+/-0.00008):
 {'cls__C': 0.001}
Rank 4: validation: 0.99233 (+/-0.00020) train: 0.99289 (+/-0.00008):
 {'cls__C': 0.10000000000000001}
Rank 5: validation: 0.99227 (+/-0.00020) train: 0.99294 (+/-0.00008):
 {'cls__C': 0.46415888336127775}

With this data only one in every 18 examples will be positive therefore the expected accuracy of a classifier that only predicts zeros is:

$$ 1 - \frac{1}{18} = \frac{17}{18} = 0.9(4) $$

So this model is outperforming the hypothetical all zeros classifier.

We can implement this all zeros classifier and use it to create a comparative measure to make it easier to discriminate between classifiers.



In [91]:

    
import sklearn.dummy



In [92]:

    
def dummycompare(X,y,train,test,best_score):
    dummy = sklearn.dummy.DummyClassifier(strategy="most_frequent")
    dummy.fit(imputer.fit_transform(X[train]),y[train])
    score = dummy.score(imputer.fit_transform(X[test]),y[test])
    if score > best_score:
        return 0.0
    else:
        #using a logit function here on a biased and scaled best score
        best_score = (best_score-score)*(1.0/(1.0-score))
        return best_score,log(best_score/(1.0-best_score))



In [97]:

    
results = np.zeros([5,2])
for i,(train,test) in enumerate(sklearn.cross_validation.StratifiedShuffleSplit(y,n_iter=5,
                                                                  test_size=10000, train_size=40000)):
    results[i,:] = dummycompare(X,y,train,test,lrsearch.find_bests()[0][0])



In [98]:

    
print mean(results,axis=0)









    



[ 0.24375   -1.1322289]

Plotting a ROC curve

A ROC curve indicates the tradeoffs possible with this classifier in terms of true positive versus false positive rate when setting the decision threshold. In some applications the price of a high false positive rate is acceptable to obtain a high true positive rate. As the false positive rate grows to 1.0 though, the true positive rate also grows to 1.0 inevitably for any classifier as all samples are being classified as positive.



In [93]:

    
def plotroc(model,Xtest,ytest):
    estimates = model.predict_proba(Xtest)
    fpr, tpr, thresholds = sklearn.metrics.roc_curve(ytest,estimates[:,1])
    clf()
    plot(fpr, tpr)
    plot([0, 1], [0, 1], 'k--')
    xlim([0.0, 1.0])
    ylim([0.0, 1.0])
    xlabel('False Positive Rate')
    ylabel('True Positive Rate')
    title('Receiver operating characteristic AUC: {0}'.format(sklearn.metrics.roc_auc_score(ytest,estimates[:,1])))
    show()
    return fpr,tpr



In [104]:

    
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=100000,train_size=400000)))



In [117]:

    
bestparams = lrsearch.find_bests()[0][-1]



In [118]:

    
%%time
model.set_params(**bestparams)
model.fit(X[train],y[train])
fpr,tpr = plotroc(model,X[test],y[test])









    












    



CPU times: user 2min 13s, sys: 19.3 s, total: 2min 32s
Wall time: 2min 35s

The above graphs shows extremely high true positive rates even at low false positive rates, which is typically very good in a classifier. This is also shown in the extremely high AUC score the classifier is achieving.



In [119]:

    
savez("../../plots/hippie/logistic.regression.roc.npz",fpr,tpr)



In [120]:

    
print dummycompare(X,y,train,test,model.score(X[test],y[test]))









    



(0.85730507191521588, 1.7930848504691246)

Precision recall curve

Precision recall curves plot precision against recall.

Precision is the "fraction of retrieved instances which are relevant"
Recall is the "fraction of relevant instances which are retrieved"

These are implemented in Scikit-learn.



In [197]:

    
def drawprecisionrecall(X,y,test,model):
    try:
        yscore = model.decision_function(X[test])
    except AttributeError:
        yscore = model.predict_proba(X[test])[:,1]
    precision,recall,_ = sklearn.metrics.precision_recall_curve(y[test],yscore)
    average_precision = sklearn.metrics.average_precision_score(y[test], yscore,
                                                         average="micro")
    plt.clf()
    plt.plot(recall,precision)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve, AUC: {0:0.2f}'.format(average_precision))
    plt.show()
    return precision, recall



In [121]:

    
precision,recall = drawprecisionrecall(X,y,test,model)

This shows that it is not possible, even with loss of precision, to recall all of the positive training examples. Unfortunately, it is only possible to recall less than 10% with perfect precision.



In [122]:

    
savez("../../plots/hippie/lr_precisionrecall.npz",precision,recall)

Classifier coefficients

We can also plot the weightings which this model is using to achieve the above performance:



In [123]:

    
plot(model.named_steps['cls'].coef_.T)









    Out[123]:





[<matplotlib.lines.Line2D at 0x5c4d9d0>]



In [124]:

    
savez("../../plots/hippie/logistic.regression.coef.npz",model.named_steps['cls'].coef_.T)

Which feature is the largest peak in the above plot?



In [125]:

    
where(model.named_steps['cls'].coef_.T==amax(model.named_steps['cls'].coef_.T))









    Out[125]:





(array([202]), array([0]))

We can see from the notebook on training and test set generation that this corresponds to one of the iRefIndex features. Looking at the notebook on extracting iRefIndex features it appears the 202nd features is the biogrid feature. Given that Biogrid is an extremely large interaction database which is also used in the production of our training set it is not surprising it is a useful feature.

Support Vector Machine

A support vector machine is a kernel-based classifier which aims to fit a hyperplane between training examples to separate the examples into different classes. The algorithm aims to maximise the distance from any given training point to the hyperplane in the high-dimensional space of the training vectors.

Building the pipeline

Optimal performance of a SVM with the default parameters depends on scaling of the data so we must build a pipeline for the model as before with logistic regression:



In [126]:

    
import sklearn.svm



In [127]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.svm.SVC()
svmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

Again, it would be useful to get an idea of acceptable sample sizes prior to running the grid search to save processing time:



In [62]:

    
!ipcluster start -n=10 --profile='altcluster' --daemon









    



2014-07-28 23:28:21.044 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_altcluster'



In [181]:

    
altclient = Client(profile='altcluster')



In [182]:

    
alb_view = altclient.load_balanced_view()



In [128]:

    
trainsizes=logspace(3,5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(alb_view)
lcsearch.launch_for_arrays(svmodel,X,y,trainsizes,n_cv_iter=3,params={'cls__C':1.0}, name="lrlc")









    Out[128]:





Progress: 00% (000/075)



In [155]:

    
print lcsearch









    



Progress: 80% (060/075)



In [159]:

    
alb_view.abort()



In [156]:

    
svmlc = lcsearch.plot_curve()



In [158]:

    
savez("../../plots/hippie/svmlc.npz",svmlc)

From this curve we can conclude that at least 20,000 training samples are going to be necessary for meaningful results.



In [160]:

    
#initialise parameters to test:
svm_params = {
'cls__C':logspace(-1,1,3),
'cls__degree':array([1,2,3,4]),
'cls__gamma':logspace(-1,1,3)
}



In [161]:

    
#initialise mmaped data
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=2000, train_size=20000,
                                                     n_cv_iter=3, name='testfeatures',random_state=42)



In [183]:

    
#initialise and start run
search = ocbio.model_selection.RandomizedGridSearch(alb_view)
search.launch_for_splits(svmodel,svm_params,split_filenames)









    Out[183]:





Progress: 00% (000/108)



In [200]:

    
print search









    



Progress: 26% (029/108)

Rank 1: validation: 0.95650 (+/-0.00029) train: 0.99723 (+/-0.00032):
 {'cls__gamma': 0.10000000000000001, 'cls__degree': 3, 'cls__C': 10.0}
Rank 2: validation: 0.95600 (+/-0.00058) train: 0.99685 (+/-0.00026):
 {'cls__gamma': 0.10000000000000001, 'cls__degree': 2, 'cls__C': 1.0}
Rank 3: validation: 0.95400 (+/-0.00087) train: 0.99727 (+/-0.00034):
 {'cls__gamma': 1.0, 'cls__degree': 3, 'cls__C': 10.0}
Rank 4: validation: 0.95383 (+/-0.00101) train: 0.99725 (+/-0.00034):
 {'cls__gamma': 10.0, 'cls__degree': 3, 'cls__C': 1.0}
Rank 5: validation: 0.95383 (+/-0.00101) train: 0.99723 (+/-0.00032):
 {'cls__gamma': 1.0, 'cls__degree': 4, 'cls__C': 1.0}

This is taking an extremely long time to train, and is not feasible in the sample sizes above. Quote from Murphy textbook on SVMs:

SVMs also take $O(N^{3})$ time to train...

Which is also repeated in the documentation for the Scikit-learn SVM classifier.

Due to this problem, it does not seem to be well-suited to this data, where we will primarily be working with very large sample sizes.



In [47]:

    
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=2000,train_size=5000)))
print len(train),len(test)



In [ ]:

    
%%time
svmodel.set_params(cls__C=10.0,cls__degree=4,cls__gamma=10.0,cls__probability=True)
svmodel.fit(X[train],y[train])
fpr,tpr = plotroc(svmodel,X[test],y[test])



In [ ]:

    
savez("../../plots/hippie/svm.roc.npz",fpr,tpr)



In [ ]:

    
svmprecision,svmrecall = drawprecisionrecall(X,y,test,svmodel)



In [ ]:

    
savez("../../plots/hippie/svm.precisionrecall.npz",)

Random forest classifier

Random forest classifiers have proven to be effective in the task of protein interaction prediction in the past. A random forest classifier combines the results of many decision tree classifiers. In this way, it can react to correlations in the data in a way that simpler methods, such as logistic regression, cannot.

The combination of decision trees is implemented in Scikit-learn by averaging the results of the individual trees.

Initialising the pipeline

As above we will define the pipeline with the same scaling and imputation:



In [129]:

    
import sklearn.ensemble



In [130]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.RandomForestClassifier()
rfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

As above, we would like to know how quickly this model will learn in a pipeline with the scaling and imputer defined as with previous models before trying to tune hyper-parameters.



In [164]:

    
!ipcluster stop









    



2014-07-30 17:42:46.451 [IPClusterStop] Using existing profile dir: u'/home/fin/.ipython/profile_default'
2014-07-30 17:42:46.528 [IPClusterStop] Stopping cluster [pid=3975] with [signal=2]



In [179]:

    
!ipcluster stop --profile='altcluster'









    



2014-07-30 17:47:09.978 [IPClusterStop] Using existing profile dir: u'/home/fin/.ipython/profile_altcluster'
2014-07-30 17:47:10.029 [IPClusterStop] Stopping cluster [pid=3655] with [signal=2]



In [165]:

    
!ipcluster start --n=10 --daemon









    



2014-07-30 17:42:53.316 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_default'



In [180]:

    
!ipcluster start --n=10 --profile='altcluster' --daemon









    



2014-07-30 17:47:14.542 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_altcluster'



In [169]:

    
client = Client(profile="default")
lb_view = client.load_balanced_view()
lb_view.abort()



In [ ]:

    
trainsizes=logspace(3,6,7)
#initialise and start run
rflcsearch = ocbio.model_selection.LearningCurve(lb_view)
rflcsearch.launch_for_arrays(rfmodel,X,y,trainsizes,n_cv_iter=3,
                             params={'cls__n_estimators':100,'cls__bootstrap':False},
                             name="lrlc")



In [145]:

    
print rflcsearch









    



Progress: 71% (105/147)



In [146]:

    
rflc = rflcsearch.plot_curve()

Training a single model with 100,000 to see how long each one of these is going to take:



In [148]:

    
savez("../../plots/random.forest.learning.curve.npz",rflc)

Running a grid search

The major parameters to vary with a Random Forest classifier are:

n_estimators - the number of trees in the forest
max_features - size of random subsets of features to consider when splitting nodes

According the to the documentation a good value for max_features is the square root of the number of features, in our case approximately 15. With the number of trees in the forests, the more trees the better the performance will be, but it will take longer to train the model. We will try up to a maximum of 200 estimators to see at which point there are no longer significant gains from increasing the number of estimators.

Also the random trees are using bootstrapping by default when training, which is unnecessary on a dataset as large as this. Therefore, we should set bootstrap=False.



In [138]:

    
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=20000, train_size=100000,
                                                     n_cv_iter=3, name='rffeatures',random_state=42)



In [149]:

    
rfparams = {'cls__n_estimators':100,'cls__bootstrap':False}



In [56]:

    
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=20000,train_size=100000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(model.score(X[train],y[train]))
print "Test score: {0}".format(model.score(X[test],y[test]))









    



Training score: 0.99938
Test score: 0.99935
CPU times: user 2min 14s, sys: 1min 24s, total: 3min 38s
Wall time: 3min 44s



In [172]:

    
rf_params = {
'cls__n_estimators':logspace(1,2.3,5).astype(np.int),
'cls__max_features':arange(5,30,5).astype(np.int),
'cls__bootstrap':[False]
}



In [173]:

    
rfsearch = ocbio.model_selection.RandomizedGridSearch(lb_view)
rfsearch.launch_for_splits(rfmodel,rf_params,split_filenames)









    Out[173]:





Progress: 00% (000/075)



In [187]:

    
print rfsearch









    



Progress: 100% (075/075)

Rank 1: validation: 0.99367 (+/-0.00033) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 44, 'cls__max_features': 20, 'cls__bootstrap': False}
Rank 2: validation: 0.99367 (+/-0.00017) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 199, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 3: validation: 0.99367 (+/-0.00017) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 44, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 4: validation: 0.99367 (+/-0.00017) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 21, 'cls__max_features': 20, 'cls__bootstrap': False}
Rank 5: validation: 0.99350 (+/-0.00029) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 10, 'cls__max_features': 25, 'cls__bootstrap': False}

Training a the model we would like to use on a large training set to get the best possible performance:



In [190]:

    
rfparams = rfsearch.find_bests()[0][-1]



In [191]:

    
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=20000,train_size=100000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(model.score(X[train],y[train]))
print "Test score: {0}".format(model.score(X[test],y[test]))









    



Training score: 0.99275
Test score: 0.9934
CPU times: user 5min 1s, sys: 58.4 s, total: 5min 59s
Wall time: 6min 14s



In [192]:

    
rfroc = plotroc(rfmodel,X[test],y[test])



In [193]:

    
savez("../../plots/random.forest.roc.npz",rfroc)



In [198]:

    
rfprec,rfrecall = drawprecisionrecall(X,y,test,rfmodel)



In [199]:

    
savez("../../plots/hippie/random.forest.precisionrecall.npz")

Extremely Randomized Trees

In Scikit-learn an alternative random forest in which the thresholds chosen for candidate features are randomly selected. This reduces the variance of the model at the cost of an increase in bias. These are called Extremely Randomized Trees.



In [201]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.ExtraTreesClassifier()
erfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])



In [203]:

    
extrasearch = ocbio.model_selection.RandomizedGridSearch(lb_view)
extrasearch.launch_for_splits(erfmodel,rf_params,split_filenames)









    Out[203]:





Progress: 00% (000/075)



In [205]:

    
print extrasearch









    



Progress: 100% (075/075)

Rank 1: validation: 0.99350 (+/-0.00029) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 199, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 2: validation: 0.99350 (+/-0.00029) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 44, 'cls__max_features': 20, 'cls__bootstrap': False}
Rank 3: validation: 0.99350 (+/-0.00029) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 44, 'cls__max_features': 15, 'cls__bootstrap': False}
Rank 4: validation: 0.99333 (+/-0.00033) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 94, 'cls__max_features': 15, 'cls__bootstrap': False}
Rank 5: validation: 0.99317 (+/-0.00017) train: 0.99732 (+/-0.00034):
 {'cls__n_estimators': 44, 'cls__max_features': 25, 'cls__bootstrap': False}

Feature Importances

We can find the feature importances for a random forest model simply by querying the model:



In [206]:

    
importances = rfmodel.named_steps['cls'].feature_importances_



In [207]:

    
plot(importances)









    Out[207]:





[<matplotlib.lines.Line2D at 0x62f8810>]



In [208]:

    
savez("../../plots/hippie/random.forest.importances.npz",importances)

Training pulldown classifier

As the pulldown results are particularly suited to our classification problem, and have been gathered specifically for us we have to include them in our algorithm. Unfortunately, as described in the the notebook on extracting pulldown features, we were not able to impute the features using regression. The solution is to train the model on a subset of the training data which the pulldown data covers. The indexes of these training examples have been saved in the training and test set generation notebook.



In [64]:

    
posindexes = loadtxt("../hippie.positive.pulldownindexes.txt",dtype=int)
negindexes = loadtxt("../hippie.negative.pulldownindexes.txt",dtype=int)

After removing conflicting entries above the indexes in the negative training set will no longer be correct. For each index we must ensure that it still points to the correct feature vector. We can do this by subtracting the number of conflict indexes that are smaller than each of the these indexes.



In [59]:

    
conflictindexes = array(conflicts.keys())
def fixoffset(negindex):
    Nlower = where(conflictindexes<negindex)[0].shape[0]
    return negindex - Nlower



In [65]:

    
negindexes = map(fixoffset, negindexes)



In [69]:

    
negindexes = array(negindexes)

To index the full X matrix, we will have to offset the index of the negative indexes again to take into account the leading positive entries:



In [70]:

    
negindexes = negindexes + posdata.shape[0]

The ratio of positive to negative examples should be the same as with the full training data above, which is 1:18.



In [216]:

    
float(negindexes.shape[0])/float(posindexes.shape[0])









    Out[216]:





4.168306801736613

Unfortunately, this means we do not have enough negative training examples. A simple solution to this problem is just to include more random negative training examples.



In [225]:

    
reqN = (18-float(negindexes.shape[0])/float(posindexes.shape[0]))*float(posindexes.shape[0])



In [227]:

    
plusneg = random_integers(0,negdata.shape[0],size=int(reqN))



In [228]:

    
if any(a in negindexes for a in plusneg):
    print "overlap"









    



overlap

Classifying Pulldown interactions

The reason we train this classifier is to then attempt estimate probabilities that an interaction exists between proteins in the pulldown experiments, specifically those in the bait and prey lists. Therefore, we will apply this classifier to feature vectors corresponding to combinations of these proteins.

Unfortunately, the file containing these feature vectors is extremely long, it contains many more lines than used for training as there are so many combinations of the proteins involved:



In [38]:

    
!~/qwc.sh pulldown.nolabel.Entrez.vectors.txt









    



1797054 pulldown.nolabel.Entrez.vectors.txt

What we would like to do is iterate over this entire file and produce a file containing two Entrez Gene IDs corresponding to each protein pair with a likelihood estimate in that interaction from the classifier. To limit the number of edges we have to deal with we place a limit on the probability an interaction must have before it is written to the file.

Classifying human genome interactions

We would like to be able to return a likelihood of interaction for any given pair of proteins. Using the model we have trained we can create a short script which is able to calculate this for any given pair of Entrez IDs:



In [218]:

    
cd ../..









    



/data/opencast/MRes



In [220]:

    
assembler = ocbio.extract.FeatureVectorAssembler("datasource.tab")



In [222]:

    
baits,preys = loadtxt("forGAVIN/pulldown_data/BAITS/baits_entrez_ids.csv",dtype=str),loadtxt("forGAVIN/pulldown_data/PREYS/prey_entrez_ids.csv",dtype=str)
pulldownids = set(list(baits)+list(preys))



In [ ]:

    
assembler.getfeaturevector()



In [ ]:



In [ ]:

    
featuresizes = assembler.